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Abstract 

We present an iterative procedure to build 
a Chinese language model (LM). We seg- 
ment Chinese text into words based on a 
word-based Chinese language model. How- 
ever, the construction of a Chinese LM it- 
self requires word boundaries. To get out 
of the chicken-and-egg problem, we propose 
an iterative procedure that alternates two 
operations: segmenting text into words and 
building an LM. Starting with an initial 
segmented corpus and an LM based upon 
it, we use a Viterbi-liek algorithm to seg- 
ment another set of data. Then, we build 
an LM based on the second set and use the 
resulting LM to segment again the first cor- 
pus. The alternating procedure provides a 
self-organized way for the segmenter to de- 
tect automatically unseen words and cor- 
rect segmentation errors. Our prelimi- 
nary experiment shows that the alternat- 
ing procedure not only improves the accu- 



racy of our segmentation, but discovers un- 



seen words surprisingly well. The resulting 
word-based LM has a perplexity of 188 for 
a general Chinese corpus. 



1 Introduction 



In statistical speech recognition( Bahl ct al., 1983 ), 
it is necessary to build a language model(LM) for as- 
signing probabilities to hypothesized sentences. The 
LM is usually built by collecting statistics of words 
over a large set of text data. While doing so is 
straightforward for English, it is not trivial to collect 
statistics for Chinese words since word boundaries 
are not marked in written Chinese text. Chinese 



is a morphosyllabic language ( DeFrancis, 1984 ) in 
that almost all Chinese characters represent a single 



syllable and most Chinese characters are also mor- 
phemes. Since a word can be multi-syllabic, it is gen- 
erally non-trivial to segment a Chinese sentence into 



words(Wu and Tseng, 1993). Since segmentation is 
a fundamental problem in Chinese information pro- 
cessing, there is a large literature to deal with the 



problem. Recent work includes (Sproat et al., 1994) 
and ( Wang et al., 199^ ) . In this paper, we adopt a 



statistical approach to segment Chinese text based 
on an LM because of its autonomous nature and its 
capability to handle unseen words. 

As far as speech recognition is concerned, what is 
needed is a model to assign a probability to a string 
of characters. One may argue that we could bypass 
the segmentation problem by building a character- 
based LM. However, we have a strong belief that a 
word-based LM would be better than a character- 
basecfj] one. In addition to speech recognition, the 
use of word based models would have value in infor- 
mation retrieval and other language processing ap- 
plications. 

If word boundaries are given, all established tech- 



niques can be exploited to construct an LM ( Iclinck 



et al., 1992) just as is done for English. Therefore, 



segmentation is a key issue in building the Chinese 
LM. In this paper, we propose a segmentation al- 
gorithm based on an LM. Since building an LM it- 
self needs word boundaries, this is a chicken-and-egg 
problem. To get out of this, we propose an iterative 
procedure that alternates between the segmentation 
of Chinese text and the construction of the LM. Our 
preliminary experiments show that the iterative pro- 
cedure is able to improve the segmentation accuracy 
and more importantly, it can detect unseen words 



A character-based trigram model has a perplexity 
of 46 per character or 46 2 per word (a Chinese word has 
an average length of 2 characters), while a word-based 
trigram model has a perplexity 188 on the same set of 
data. While the comparison would be fairer using a 5- 
gram character model, that the word model would have 
a lower perplexity as long as the coverage is high. 



automatically. 

In section ^, the Viterbi-like segmentation algo- 
rithm based on a LM is described. Then in sec- 
tion section:iter-proc we discuss the alternating pro- 
cedure of segmentation and building Chinese LMs. 
We test the segmentation algorithm and the alter- 
nating procedure and the results are reported in sec- 
tion ^. Finally, the work is summarized in section 

2 segmentation based on LM 

In this section, we assume there is a word-based Chi- 
nese LM at our disposal so that we are able to com- 
pute the probability of a sentence (with word bound- 
aries). We use a Viterbi-like segmentation algorithm 
based on the LM to segment texts. 

Denote a sentence S by C1C2 • • ■ C„-iC„, where 
each Cj (1 < i < n } is a Chinese character. To seg- 
ment a sentence into words is to group these char- 
acters into words, i.e. 

S = C1C2 ■ ■ ■ C n -iC n (1) 

= (C 1 ---C X1 )(C X1+1 ---C X2 ) (2) 

■■■(C Xm _ 1+1 ---C Xm ) (3) 

= wiw 2 ■■■w m (4) 

where Xk is the index of the last character in k th 
word Wk, i,e Wk = C Xk _ 1+ i ■ ■ ■ C Xk (k = 1, 2, • • • , m), 
and of course, xo — 0, x m — n. 

Note that a segmentation of the sentence S can 
be uniquely represented by an integer sequence 
x%, ■ ■ ■ , x rn , so we will denote a segmentation by its 
corresponding integer sequence thereafter. Let 

G(S) = {(xi ■ ■ ■ x m ) : 1 < xi < ■ ■ ■ < x m , m < n} (5) 

be the set of all possible segmentations of sentence 
S. Suppose a word-based LM is given, then for a 
segmentation g(S) = (x\ ■ ■ ■ x m ) S G(S), we can 
assign a score to g(S) by 



L(g{S)) = logP g (wi ■ ■ ■ w m ) 

m 

= ^2^og P g (wi\hi) 



(6) 
(7) 



where w 3 = C X] _ 1+ i ■ ■ ■ C Xj (j = 1, 2, • • • , m), and hi 
is understood as the history words wi ■ ■ ■ Wj-t . In 



this paper the trigram modcl( Jclinck ct al., 1992 ) is 
used and therefore hi — Wi-2Wi-\ 

Among all possible segmentations, we pick the one 
g* with the highest score as our result. That is, 



g* = arg max L(g(S)) 

geG(S) 

= arq max logP (uii 

geG(S) 



■w m ) 



(8) 
(9) 



Note the score depends on segmentation g and this 
is emphasized by the subscript in (^). The optimal 
segmentation g* can be obtained by dynamic pro- 
gramming. With a slight abuse of notation, let L{k) 
be the max accumulated score for the first k charac- 
ters. L(k) is defined for k — 1, 2, • • • , n with L(l) = 
and L(g*) = L(n). Given {L(i) : 1 < i < k - 1}, 
L(k) can be computed recursively as follows: 

L(k)= max [L(i) + log P(C i+1 ■ ■ ■ C k \hi)} (10) 

l<i<k — 1 

where hi is the history words ended with the i th 
character Ci. At the end of the recursion, we need 
to trace back to find the segmentation points. There- 
fore, it's necessary to record the segmentation points 
in ©. 

Let p(k) be the index of the last character in the 
preceding word. Then 



p(k) 



arg max [L(i) 

Ki<fc-1 



logP(C 



■C k \hi)} (11) 



that is, C p (fc) +1 • • • Cfc comprises the last word of the 
optimal segmentation up to the k th character. 

A typical example of a six-character sentence is 
shown in table [l| Since p(6) = 4, we know the last 
word in the optimal segmentation is C§C§. Since 
p(4) = 3, the second last word is C4. So on and so 
forth. The optimal segmentation for this sentence is 
(C 1 )(C 2 C 3 )(C 4 )(C 5 C 6 ) . 



Table 1: A segmentation example 



chars 


Ci 


c 2 


c 3 


Ci 


c 5 


c 6 


k 


1 


2 


3 


4 


5 


6 


P(k) 





1 


1 


3 


3 


4 



The searches in ( [1 0|) and ([11]) are in general time- 
consuming. Since long words are very rare in Chi- 
nese(94% words are with three or less characters 



( Wu and Tseng, 1993 )), it won't hurt at all to limit 
the search space in (|l0[) and (|ll|) by putting an up- 
per bound(say, 10) to the length of the exploring 
word, i.e, impose the constraint i > maxl, k — d in 
( |l0| ) and ([ll]) , where d is the upper bound of Chinese 
word length. This will speed the dynamic program- 
ming significantly for long sentences. 

It is worth of pointing out that the algorithm in 
( |l0|) and ( |TT|) could pick an unseen word(i.e, a word 
not included in the vocabulary on which the LM is 
built on) in the optimal segmentation provided LM 
assigns proper probabilities to unseen words. This is 
the beauty of the algorithm that it is able to handle 
unseen words automatically. 



3 Iterative procedure to build LM 



In the previous section, we assumed there exists a 
Chinese word LM at our disposal. However, this is 
not true in reality. In this section, we discuss an it- 
erative procedure that builds LM and automatically 
appends the unseen words to the current vocabulary. 

The procedure first splits the data into two parts, 
set T\ and T^. We start from an initial segmenta- 
tion of the set T\. This can be done, for instance, 



by a simple greedy algorithm described in (Sproat 
et al., 1994). With the segmented Tj., we construct 
a LM{ on it. Then we segment the set Ti by using 
the LMi and the algorithm described in section |^. 
At the same time, we keep a counter for each unseen 
word in optimal segmentations and increment the 
counter whenever its associated word appears in an 
optimal segmentation. This gives us a measure to 
tell whether an unseen word is an accidental charac- 
ter string or a real word not included in our vocab- 
ulary. The higher a counter is, the more likely it is 
a word. After segmenting the set T2, we add to our 
vocabulary all unseen words with its counter greater 
than a threshold c. Then we use the augmented 
vocabulary and construct another LMi + \ using the 
segmented T%. The pattern is clear now: LMj+i is 
used to segment the set T\ again and the vocabulary 
is further augmented. 

To be more precise, the procedure can be written 
in pseudo code as follows. 

Step 0: Initially segment the set T\. 

Construct an LM LMq with an initial vocabu- 
lary Vq. 
set i=l. 



al., Hg| ), a title driven method was used to identify 
personal names. The iterative procedure proposed 
here provides a self-organized way to detect unseen 
words, including proper nouns. The advantage is 
that it needs little human intervention. The proce- 
dure provides a chance for us to correct segmenting 
errors. 

4 Experiments and Evaluation 

4.1 Segmentation Accuracy 

Our first attempt is to see how accurate the segmen- 
tation algorithm proposed in section |^ is. To this 
end, we split the whole data set into two parts, half 
for building LMs and half reserved for testing. The 
trigram model used in this experiment is the stan- 
dard deleted interpolation model described in (Jc~ 
linck et al., 1992| ) with a vocabulary of 20K words. 

Since we lack an objective criterion to measure 
the accuracy of a segmentation system, we ask three 
native speakers to segment manually 100 sentences 
picked randomly from the test set and compare 
them with segmentations by machine. The result is 
summed in table ||, where ORG stands for the orig- 
inal segmentation, PI, P2 and P3 for three human 
subjects, and TRI and UNI stand for the segmen- 
tations generated by trigram LM and unigram LM 
respectively. The number reported here is the arith- 
metic average of recall and precision, as was used in 
flBproat et al., 1994Q , i.e., l/2(^ + 2a), where n c 
is the number of common words in both segmenta- 
tions, ri\ and n% are the number of words in each of 
the segmentations. 



Step 1: Let j=i mod 2; 

For each sentence S in the set Tj, do 

1.1 segment it using LMi_\. 

1.2 for each unseen word in the optimal seg- 
mentation, increment its counter by the 
number of times it appears in the optimal 
segmentation. 

Step 2: Let A=the set of unseen words with 
counter greater than c. 
set Vi = Vi-i U A. 

Construct another LMi using the segmented set 
Tj and the vocabulary Vi. 

Step 3: i=i+l and goto step 1. 

Unseen words, most of which are proper nouns, 
pose a serious problem to Chinese text segmenta- 
tion. In ( Sproat et al., 1994 ) a class based model was 
proposed to identify personal names. In (Wang et 



Table 2: Segmentation Accuracy 





ORG 


PI 


P2 


P3 


TRI 


UNI 


ORG 










94.2 


91.2 


PI 


85.9 








85.3 


87.4 


P2 


79.1 


90.9 






80.1 


82.2 


P3 


87.4 


85.7 


82.2 




85.6 


85.7 



We can make a few remarks about the result 
in table |^. First of all, it is interesting to note 
that the agreement of segmentations among human 
subjects is roughly at the same level of that be- 
tween human subj ects and machine. This confirms 
what reported in ( Bproat et al., 1994 ). The major 
disagreement for human subjects comes from com- 
pound words, phrases and suffices. Since we don't 
give any specific instructions to human subjects, 



2 The corpus has about 5 million characters and is 
coarsely pre-segmented. 



one of them tends to group consistently phrases 
as words because he was implicitly using seman- 
tics as his segmentation criterion. For example, he 
segments the sentence |^| dao4 jial li2 chil dun4 
fan4(see table ||) as two words dao4 jial li2(go 
home) and chil dun4 f an4(have a meal) because 
the two "words" are clearly two semantic units. The 
other two subjects and machine segment it as dao4 
/ jial 112/ chil/ dun4 / fan4. 



Table 4: Segmentation of accuracy after one itera- 



Chinese has very limited morphology (Spencer 
|l99l| ) in that most grammatical concepts are con- 
veyed by separate words and not by morphological 
processes. The limited morphology includes some 
ending morphemes to represent tenses of verbs, and 
this is another source of disagreement. For exam- 
ple, for the partial sentence zuo4 wan2 le, where 
le functions as labeling the verb zuo4 wan2 as "per- 
fect" tense, some subjects tend to segment it as two 
words zuo4 wan2/ le while the other treat it as one 
single word. 

Second, the agreement of each of the subjects with 
either the original, trigram, or unigram segmenta- 
tion is quite high (see columns 2, 6, and 7 in Table H) 
and appears to be specific to the subject. 



tion 




earlier. After one iteration, the agreement with 
the original segmentation decreased by 3 percentage 
points, while the agreement with the human segmen- 
tation increased by less than one percentage point. 
We ran our computation intensive procedure for one 
iteration only. The results indicate that the impact 
on segmentation accuracy would be small. However, 
the new unsegmented corpus is a good source of au- 
tomatically discovered words. A 20 examples picked 
randomly from about 1500 unseen words are shown 
in Table pj. 16 of them are reasonably good words 
and are listed with their translated meanings. The 
problematic words are marked with "?" . 



Table 3: Segmentation of phrases 



Chinese 


dao4 


jial li2 


chil 


dun4 fan4 


Meaning 


go 


home 


eat 


a meal 



Third, it seems puzzling that the trigram LM 
agrees with the original segmentation better than a 
unigram model, but gives a worse result when com- 
pared with manual segmentations. However, since 
the LMs arc trained using the presegmented data, 
the trigram model tends to keep the original segmen- 
tation because it takes the preceding two words into 
account while the unigram model is less restricted 
to deviate from the original segmentation. In other 
words, if trained with "cleanly" segmented data, a 
trigram model is more likely to produce a better seg- 
mentation since it tends to preserve the nature of 
training data. 

4.2 Experiment of the iterative procedure 

In addition to the 5 million characters of segmented 
text, we had unsegmented data from various sources 
reaching about 13 million characters. We applied 
our iterative algorithm to that corpus. 

Table [| shows the figure of merit of the resulting 
segmentation of the 100 sentence test set described 

3 Here we use Pin Yin followed by its tone to represent 
a character. 



Table 5: Examples of unseen words 



Pi n Yin 


Meaning 


kui2 cr2 


last name of former US vice president 


he2 shi4 lu4 yinl dai4 


cassette of audio tape 


shou2 dao3 


(abbr)prctcct (the) island 


rcn4 zhong4 


first name or part of a phrase 


ji4 jian3 


(abbr) discipline monitoring 


zi4 hai4 


7 


shuangl bao3 


double guarantee 


ji4 dongl 


(abbr) Eastern He Bci province 


zi3 jiaol 


purple glue 


xiaol long2 shi2 


personal name 


Ii4 bo4 hai3 


? 


du4 shanl 


? 


shangl ban.4 


(abbr) commercial oriented 


liu6 hai4 


six (types of) harms 


sa4 he4 lc4 


translated name 


kuai4 xun4 


fast news 


chcng4 jing3 


train cop 


huang2 du2 


yellow poison 


ba3 lian2 


7 


he2 dao3 


a (biological) jargon 



4.3 Perplexity of the language model 

After each segmentation, an interpolated trigram 
model is built, and an independent test set with 
2.5 million characters is segmented and then used 
to measure the quality of the model. We got a per- 
plexity 188 for a vocabulary of 80K words, and the 
alternating procedure has little impact on the per- 
plexity. This can be explained by the fact that the 
change of segmentation is very little ( which is re- 
flected in table reftab:accuracy-iter ) and the addi- 
tion of unseen words(1.5K) to the vocabulary is also 
too little to affect the overall perplexity. The merit 
of the alternating procedure is probably its ability 



to detect unseen words. 

5 Conclusion 

In this paper, we present an iterative procedure 
to build Chinese language model(LM). We segment 
Chinese text into words based on a word-based Chi- 
nese language model. However, the construction of 
a Chinese LM itself requires word boundaries. To 
get out of the chicken-egg problem, we propose an 
iterative procedure that alternates two operations: 
segmenting text into words and building an LM. 
Starting with an initial segmented corpus and an 
LM based upon it, we use Viterbi-like algorithm to 
segment another set of data. Then we build an LM 
based on the second set and use the LM to seg- 
ment again the first corpus. The alternating proce- 
dure provides a self-organized way for the segmenter 
to detect automatically unseen words and correct 
segmentation errors. Our preliminary experiment 
shows that the alternating procedure not only im- 
proves the accuracy of our segmentation, but dis- 
covers unseen words surprisingly well. We get a per- 
plexity 188 for a general Chinese corpus with 2.5 
million characters n. 
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